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Abstract 



2 Conceptual Issues 



Computational mechanics is a method for dis- 
covering, describing and quantifying patterns, 
using tools from statistical physics. It con- 
structs optimal, minimal models of stochastic 
processes and their underlying causal struc- 
tures. These models tell us about the intrinsic 
computation embedded within a process — how 
it stores and transforms information. Here 
we summarize the mathematics of computa- 
tional mechanics, especially recent optimality 
and uniqueness results. We also expound the 
principles and motivations underlying compu- 
tational mechanics, emphasizing its connec- 
tions to the minimum description length prin- 
ciple. PAC theorv. and other aspects of ma- 
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1 Introduction 


mechanics follows an "inverse" strategy, ex- 


tending the idea of extracting "geometry from 


All students of machine learning are familiar 


a time series" (Packard, Crutchfield, Farmer, 


with pattern recognition; in this paper we wish 


and Shaw 19801). It builds the simplest model 



to introduce a new term for a related, rela- 
tively under-recognized concept, pattern dis- 
covery, and a way of tackling such problems, 
computational mechanics. 

The term pattern discovery is meant to con- 
trast with both pattern recognition and pat- 
tern learning. In pattern recognition, the goal 
of the system is to accurately assign inputs to 
pre-set categories. In most learning svstems, 



capable of capturing the patterns in the data — 
a representation of the causal structure of 
the hidden process which generated the ob- 
served behavior]^ In a sense that will be 
made clear as we go on, this representation — 
the e-machine — is the unique maximally effi- 
cient model of the observed data-generating 
process. The basic ideas of computational 



mechanics were introduced in Crutchfield and 



the goal is to determine which of several pre- 



set categorization schemes is correct. ( Natu- 



rally, the two tasks are closely connected ( |Vap- 



nik 1995 ).) In either case, the representations 
used have been, as it were, handed down from 
on high, due to choices external to the recog- 
nition and learning procedures. 

In pattern discovery, however, the aim is 
to avoid, so far as possible, such a priori 
assumptions about what structures are rele- 
vant. This is, of course, an ancient prob- 
lem, and one which has not been ignored in 
machine learning. While there are ingenious 
schemes for pattern discovery via trial and er- 
ror, some even informed by empirical psychol- 



ogy (Holland, Holyoak, Nisbett, and Thagard 



1986), we believe that a more direct approach 
is not only possible but also illuminates the 
ideal results and the limitations of all pattern- 
discovery methods. 

Computational mechanics originated in 
physics as a complementary approach to sta- 
tistical mechanics for dealing with complex. 



organized systems (Crutchfield 1994). In such 
systems the "forward" approach of statis- 
tical mechanics — deriving macroscopic prop- 
erties from the interactions of microscopic 
components — is often intractable, though data 
can be had in abundance]^ Computational 



Young (1989[ ) . Since then they have been used 
to analyze dynamical systems, cellular au- 
tomata, hidden Markov models, evolved spa- 
tial computation, stochastic resonance, glob- 
ally coupled maps, and the dripping faucet 



experiment; see Shalizi and Crutchfield (1999 



Sec. 1) for references. 

This paper is arranged as follows. First 
we examine some conceptual issues about pat- 
tern discovery and the way they are addressed 
by computational mechanics. We devote the 
bulk of this paper to a summary of the math- 
ematical structure of computational mechan- 
ics, with particular attention to optimality and 
uniqueness theorems. Results are stated with- 
out proof; readers will find a full treatment in 
Shahzi and Crutchfield (1999|). Then we dis- 



cuss the ties between computational mechan- 
ics and several approaches to machine learn- 
ing. Finally, we close by pointing out direc- 
tions for future theoretical work. 



^But see [Chaikin and Lubcnsky (1995 ) and Orosf 



and Hohenberg (1993) for organized systems where the 

"forwar d" approach works. 

■^See ^eldman and Crutchfield (1998 ) for an exam- 
ple of using both statistical and computational me- 
chanics to analyze the same physical system. 
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2 Conceptual Issues 

Any approach to handling patterns should, we 
claim, meet a number of criteria; the justifica- 



tions for wh ich are given in Shalizi and Crutch- 
field (199S| ) in detail. It should be at once 



1. Predictive^ i.e., the models it produces 
should allow us to predict the original 
process or system we are trying to un- 
derstand; and, by that token, provide a 
compressed description of it; 

2. Computational, showing how the process 
stores, transmits, and transforms infor- 
mation; 

3. Calculable, analytically or by systematic 
approximation; 

4. Causal, telling us how instances of the 
pattern are actually produced; and 

5. Naturally stochastic, not merely tolerant 
of noise but explicitly formulated in terms 
of ensembles. 

In any modeling approach, the two (related) 
problems are to devise a mapping from states 
of the world (or, more modestly, states of in- 
puts) to states of the model, and to accurately 
and precisely predict future states of the world 
on the basis of the evolution of the model. 



(Cf. Holland, Holyoak, Nisbett, and Thagard 



(1986 ) on "q-morphisms" .) The key idea of 
computational mechanics is that the informa- 
tion required to do this is actually in the data, 
provided there is enough of it. In fact, if we 
go about it right, the key step is getting the 
mapping from data to model states right — 
equivalently, the problem is to decide which 
data-sets should be treated as equivalent and 
how data should be partitioned. Once we have 
the correct mapping of data into equivalence 
classes, accurate prediction is actually fairly 
simple. That the correct mapping should treat 



as equivalent all data-sets which leave us in 
the same degree of knowledge about the fu- 
ture has a certain intuitive plausibility, but 
also sounds hopelessly vague. In fact, we can 
specify such a partition in a precise, opera- 
tional way, show that it is the best one to use, 
and determine it empirically. We call the func- 
tion which induces that partition e, and its 
equivalence classes causal states. In fact, the 
model we get from using such a partition — 
the e-machine — meets all the criteria stated 
above. It is because the e-machine shows, in a 
very direct way, how information is stored in 
the process, and how that stored information 
is transformed by new inputs and by the pas- 
sage of time, that computational mechanics is 
about computation. 



3 Mathematical Develop- 
ment 

3.1 Note on Information Theory 

The bulk of the following development will be 
consumed with notions and results from infor- 
mation theory. We follow the standard def- 



initions and notation of Cover and Thomas 



(1991), to which we refer readers unfamiliar 
with the theory. In particular, H[X] is the 
entropy of the discrete random variable X, in- 
terpreted as the uncertainty in X, measured in 
bits.^ is the entropy of X conditional 

on Y, and I[X; Y] the mutual information be- 
tween the two random variables. 



3.2 Hidden Processes 

We restrict ourselves to discrete-valued, 
discrete-time stationary stochastic processes. 



•^Here, and throughout, we follow the convention of 
using capital letters to denote random variables and 
lower-case letters their particular values. 
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(See Sec. 5.1 for discussion of these restric- 
tions.) Intuitively, such processes are se- 
quences of random variables Si, the values of 
which are drawn from a countable set A. We 
let i range over all the integers, and so get a bi- 

infinite sequence 5*= . . • <S'-iS'o'5'i . . .. In fact, 
we define a process in terms of the distribution 
of such sequences. 

Given that S is well-defined, there are prob- 
ability distributions for sequences of every 

finite length. Let St be the sequence of 
St, St+i, . . . , St+L-i of L random variables be- 

ginning at St- 5t= A, the null sequence. Like- 

wise, St denotes the sequence of L random 
variables going up to St, but not including it: 

St =St~L- Both 5* f and 5t take values from 

G A^. Similarly, St and 5** are the semi- 
infinite sequences starting from and stopping 
at t and taking values s and s , respectively. 

Requiring the process Si to be stationary 
means that 

Pit ^s^)^PiSo^s^) , (1) 

for all i e Z, L e Z+, and all G A^ . 
(A stationary process is one that is time- 
translation invariant.) Consequently, P{St=s 

) = P(5o=s) and P(st= s) = P(5o= s), 
and so the subscripts may be dropped. 

3.3 Effective States 

Our goal is to predict all or part of S using 
some function of some part of 5*. We begin by 
taking the set S of all pasts and partitioning 
it into mutually exclusive and jointly compre- 
hensive subsets. That is, we make a class TZ of 
subsets of pasts. (See Fig. |l|.) Each p ^IZ will 
be called a state or an effective state. When 
the current history s is included in the set p, 
we will say the process is in state p. Thus, 




Figure 1: A schematic picture of a partition 

of the set S of all histories into some class 
of effective states: TZ = {TZi : i = 1,2,3,4}. 
Note that the TZi need not form compact sets; 
we simply draw them that way for clarity. One 
should have in mind Cantor sets or other more 
pathological structures. 

there is a function from histories to effective 
states: 

•q-.S^TZ . (2) 

An individual history s G S maps to a specific 
state p G IZ; the random variable S for the 
past maps to the random variable TZ for the 
effective states. 

Any function defined on S will serve to par- 
tition that set: we just assign to the same p 
all the histories s on which the function takes 
the same value. (Similarly, any equivalence re- 
lation on S partitions it.) Each effective state 
has a well-defined conditional distribution of 
futures, though not necessarily a unique one. 
Specifying the effective state thus amounts to 
making a prediction about the process's fu- 
ture. In this way, the framework formally in- 
corporates traditional methods of time-series 
analysis. 
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3.4 Patterns in Ensembles 

It will be convenient to have a way of talking 
about the uncertainty of the future. We do 

not want to use H[S], since that is infinite in 

general. Instead, we will work with H[S ], 
the uncertainty of the next L symbols, treated 
as a function of L. 

Definition 1 (Capturing a Pattern) TZ 

captures a pattern iff there exists an L such 
that 



H[S |7^] < LH[S] 



(3) 



7?. captures a pattern when it tells us some- 
thing about how the distinguishable parts of a 
process affect each other: TZ exhibits their de- 
pendence. (We also speak of 77 as capturing a 

pattern.) The smaller H[S |7^], the stronger 
the pattern captured by TZ. Our first result 
bounds how strongly 7?. can capture a pro- 
cess's pattern. 

Lemma 1 For all TZ and for all L G Z+ , 

H[s'^\TZ]>H[s'^\S]. (4) 



the 7y-states, and we can set up the following 
measure of resources. 

Definition 2 (Complexity of State 
Classes) The statistical complexity of a class 
TZ of states is 

C^{TZ) = H[n] (5) 

= -^p(7^ = p)log2P(7^ = p) , 

when the sum converges to a finite value. 

Cf_i {TZ) is the average uncertainty in the pro- 
cess's current state TZ. This is the same as 
the average amount of memory (in bits) that 
the process appears to retain about the past, 
given the chosen state class TZ. We wish to 
do with as little of this memory as possible. 
Our objective, then, is to find a state class 
which minimizes C^, subject to the constraint 
of maximally accurate prediction. 

3.6 Causal States 

Definition 3 (A Process's Causal States) 

The causal states of a process are the members 

of the range of the function e : S 1-^ 2 ^ — the 
power set of S •' 



3.5 Minimality and Prediction 

Let's invoke Occam's Razor: "It is vain to do 
with more what can be done with less" . To use 
the razor, we have to fix what is to be "done" 
and what "more" and "less" mean. The job 
we want done is accurate prediction; i.e., to re- 

duce the conditional entropies H[S \TZ] as far 
as possible, down to the bound set by Lemma 
0. But we want to do this as simply as pos- 
sible, with as few resources as possible. To 
meet both constraints — minimal uncertainty 
and minimal resources — we will need a mea- 
sure of the second. Since P{S— s ) is well de- 
fined, it induces a probability distribution on 



for aU s eS, s eS} , 



(6) 



that maps from histories to classes of his- 
tories. We write the i^^ causal state as Si 
and the set of all causal states as S ; the cor- 
responding random variable is denoted S and 
its realization a . 

The cardinality of S is unrestricted. S can 
be finite, countably infinite, a continuum, a 
Cantor set, or something stranger stiUj] 

^E xamp es of all of t hese are given in [Crutchficld 
(1994| ) and [Uppcr (1997| ). 
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We could equally well define an equivalence 
relation ~£ such that two histories are equiv- 
alent iff they have the same conditional dis- 
tribution of futures, and define causal states 
as the equivalence classes of ~f . (In fact, this 



was the original approach of Crutchfield and 



Young (1989).) Either way, we break S into 
parts that leave us in different conditions of 
ignorance about the future. 

Each causal state Si has a conditional distri- 
bution of futures, P(5' \Si). It follows directly 
from Dcf. ^ that no two states have the same 
distribution of futures; this is not true of ef- 
fective states in general. Another immediate 
consequence of that definition is that 

P(5=s |5 = e(s)) =P(5=s I i= s). (7) 

Again, this is not generally true of effective 
states. 



3.8 e-Machines 

Definition 5 (An e-Machine Defined) 

The e-machine of a process is the ordered pair 
{e,T}, where e is the causal state function 
and T is set of the transition matrices for the 
states defined by e. 



Lemma 2 (e-Machines Are Determinis- 
tic) For each Si and s ^ A, Tj^j'^ > only for 
that Sj for which e{ ss) = Sj iff e{ s ) = Si, for 
all pasts s . 



"Deterministic" is meant in the sense of 
automata-theory, not dynamics. 

Lemma 3 (Causal States Are Indepen- 
dent) The probability distributions over causal 
states at different times are conditionally inde- 
pendent. 



3.7 Causal State-to-State Tran- 
sitions 

The causal state at any given time and the 
next value of the observed process together de- 
termine a new causal state (Lemma ^ below, 
which doesn't rely on the following). Thus, 
there is a natural relation of succession among 
the causal states. 

Definition 4 (Causal Transitions) The la- 
is) 

beled transition probability T^j is the probabil- 
ity of making the transition from state Si to 
state Sj while emitting the symbol s € A: 



T^f =P(S' = Sj, S =s\S = Si 



(8) 



where S is the current causal state and S' its 
successor on emitting s. We denote the set 
{T^f -.seAjbyT. 



This indicates that the causal states, con- 
sidered as a process, define a kind of Markov 
chain. We say "kind of" since the class of 
e-machines is substantially richer than the 
one normally associated with Markov chains 
dCrutchfield 199^ ; [Upper 19971 ). 



Definition 6 (e-Machine Reconstruction) 

e-Machine reconstruction is any procedure that 
given a process P{S), or an approximation of 
Pis), produces the process's e-machine {e, T}. 



Given a mathematical description of a pro- 
cess, one can often calculate analytically its 
e-machine. (For example, see the computa- 
tional mechanics analysis of statistical me- 



chanical spin systems in Feldman and Crutch 
field (199^ ).) There are also algorithms that 



reconstruct e-machines from empirical esti- 



mates of P(S'). Those used in Crutchfield 
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(1994), prutchfield and Young (1989|), Han- 


son (1993 


), and 


Perry and Binder (1999|), op- 



erate in batch mode, taking the raw data as 
a whole and producing the e-machine. Others 
could work on-line, taking in individual mea- 
surements and re-estimating the set of causal 
states and their transition probabilities. 

3.9 Optimalities and Uniqueness 

Theorem 1 (Causal States are MfLxi- 
mally Prescient) For all TZ and all L G Z+, 

H[T\n] > His'^is] = S] ■ (9) 



Causal states are as good at predicting the 
future — are as prescient — as complete histo- 
ries. Since the causal states can be system- 
atically approximated, we have shown that 
the upper bound on the strength of patterns 
(Def. |l and Lemma |^) can in fact be reached. 

All subsequent results concern rival states 
that are as prescient as the causal states. We 
call these prescient rivals and denote a class of 
them a. 

Definition 7 (Prescient Rivals) Prescient 
rivals H- are states that are as predictive as 
the causal states; viz., for all L G 

i^[i^|7^] = ij[^^|5] . (10) 



Lemma 4 (Refinement Lemma) For all 

prescient rivals Tt and for each p G Tt, there 
is a a £ S and a measure-0 subset po C p, 
possibly empty, such that p\ po ^ cr, where \ 
is set subtraction. 

The lemma becomes more intuitive if we ig- 
nore for a moment the measure-0 set po of his- 
tories. It then says that any alternative parti- 
tion "R. that is as prescient as the causal states 




Figure 2: An alternative class TZ of states 

(delineated by dashed lines) that partition S 
overlaid on the causal states S (solid lines). 
Here, for example, 52 contains parts of TZi, 
7^2, "^3 and TZ4. Note again that the TZi 
need not be compact nor simply connected, 
as drawn. 



must be a refinement of the causal-state par- 
tition. That is, each TZi must be a (possibly 
improper) subset of some Sj. Otherwise, at 
least one TZi would contain parts of at least two 
causal states. Therefore, using TZi to predict 
the future observables would lead to more un- 
certainty about S than using the causal states. 
(Compare Fig. ^ with Fig. |[) Because the 
histories in po have zero probability, treating 
them the "wrong" way makes no discernible 
difference to predictions. 

Theorem 2 (Causal States Are Minimal) 

For all prescient rivals TZ, 



C^^iU) > C^S) 



(11) 



If we were trying to predict, not the whole 
of S but some limited piece S , the causal 
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Figure 3: A prescient rival partition li- must 
be a refinement of the causal-state partition 
almost everywhere. Almost all of each TZi must 
lie within some Sj] the exceptions, if any, are a 
set of histories of measure 0. Here for instance 
52 contains the positive- measure parts of TZ^, 
7^.4, and TZ^. One of these rival states, say 7^3, 
could have member-histories in any or all of 
the other causal states, if the total measure of 
these exceptional histories is zero. 

states might not be the simplest ones with full 
predictive power. For any value of L, how- 
ever, the states constructed by analogy to the 
causal states — the "truncated causal states" — 
have maximal prescience and minimal C^. 

The minimality theorem licenses the follow- 
ing definition. 

Definition 8 (Statistical Complexity of 
a Process) The statistical complexity (7^(0) 
o/ a process O is that of its causal states: 

Theorem 3 (Causal States Are Unique) 

For all prescient rivals Tt, ifC^{'R.) — C^{S), 
then there exists an invertible function between 
Tt and S that almost always preserves equiv- 
alence of state: Tt and rj are the same as S 



and e, respectively, except on a set of histories 
of measure 0. 

The remarks on Lemma ^ also apply to the 
ineliminable but immaterial measure-0 caveat 
here. 

Theorem 4 (e-Machines Are Minimally 
Stochastic) For all prescient rivals 'ft, 



H[k'\R] > H[S'\S] , 



(12) 



where S' and TV are the next causal state of 
the process and the next rj-state, respectively. 

Finally, we relate to an information the- 
oretic quantity that is often used to measure 
complexity. 

Definition 9 (Excess Entropy) The 

excess entropy E o/ a process is the mutual 
information between its semi-infinite past and 
its semi-infinite future: 



E = I[S;S] 



(13) 



Excess entropy is regularly re-introduced 
into the complexity-measure literature, as 
"predictive information" , "stored informa- 
tio n" , "effective measure comple xity" , and so 
on ( IShalizi and Crutchfield i999| . Sec. VI). As 
these names indicate, it is tempting to see E as 
the amount of information stored in a process 
(which accounts for its popularity). Accord- 
ing to the following theorem this temptation 
should be resisted. 

Theorem 5 The statistical complexity C'^ 
bounds the excess entropy E.- 



E<C45) , 
with equality iff H[S\ S] — 0. 



(14) 



E is thus only a lower bound on the true 
amount of information stored in the process, 
namely C^(<S). 
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4 Relations to Other Fields 

4.1 Computational and Statisti- 
cal Learning Theory 

The goal of computational learning theory 
(Kearns and Vazirani 1994; Vapnik 1995) is to 
identify algorithms that quickly, reliably, and 
simply lead to good representations of a tar- 
get concept, usually taken to be a dichotomy 
of a feature or input space. Particular atten- 
tion is paid to "probably approximately cor- 



rect" (FA(Jj procedures (Valiant 1984 j: those 
having a high probability ot tmdmg a close 
match to the target concept among members 
of a fixed representation class. The key word 
here is "fixed" . While taking the representa- 
tion class as a given is in line with implicit 
assumptions in most of mathematical statis- 
tics, it seems dubious when analyzing learn- 
ingin the real world ( Crutchfield 1994 ; Boden 



1994) 



In any case, such an assumption is clearly 
inappropriate if our goal is pattern discovery, 
and it was not made in the preceding devel- 
opment. While we plan to make every possi- 
ble use of the results of computational learn- 
ing theory in e-machine reconstruction, we feel 
this theory is more properly a part of statisti- 
cal inference and, particularly, of algorithmic 
parameter estimation, than of pattern discov- 
ery per se. 

4.2 Formal Language Theory 
and Grammatical Inference 

It is well known that formal languages can 
be classified into a hierarchy, the higher lev- 
els of which have strictly greater expressive 
power. The denizens of the lowest level of the 
hierarchy, the regular languages, correspond 
to finite-state machines and to hidden Markov 
models of finite dimension. In such cases, rel- 
atives of our minimality and uniqueness the- 



orems are well known, and the construction 
of causal states is analogous to Nerode equiv- 



alence classing (Hopcroft and UUman 1979). 
Our theorems, however, are not restricted to 
this setting. 

The problem of learning a language from ob- 
servational data has been extensively studied 
by linguists and computer scientists. Unfor- 
tunately, good learning techniques exist only 
for the two lowest classes in the hierarchy, the 
regular and the context-free languag es. (For a 



good account of these procedures see Charniak 



(1993).) Adapting this work to the reconstruc- 



tion of e-machines should be a useful area for 
future research. 

4.3 The Minimum Description- 
Length Principle 

Rissanen's minimum description- 

length (MDL) principle, best presented in his 
book ( |1989| ), is a way of picking the most con- 
cise generative model out of a chosen family of 
models that are all statistically consistent with 
given data. The MDL approach starts from 
Shannon's results on the connection between 



probability distrib utions and codes (Shannon 



and Weaver 1963) 



Suppose we choose a class A4 of models and 
are given data set x. The MDL principle tells 
us to use the model M G that minimizes 
the sum of the length of the description of x 
given M, plus the length of description of M 
given A4 . The description length of x is taken 
to be — logP(a;|M). The description length of 
M may be regarded as either given by some 
coding scheme or, equivalently, by some dis- 
tribution over the members of M.. 

Though the MDL principle was one of the 
inspirations of computational mechanics, our 
approach to pattern discovery does not fit 
within Rissanen's framework. To mention only 
the most basic differences: We have no fixed 
class of models A4; we do not use encodings of 
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rival models or prior distributions over them; 
and Cfi{'R.) is not a description length. 

4.4 Connectionist Models 

Neural networks engaged in unsupervised 



learning ( Becker 1991 ) are often e ffectively do - 
ing pattern discovery. Certainly Hebb (1948| ) 



had this aim in mind when proposing his learn- 
ing rule. While such networks certainly can 
discover regularities and covariations, they of- 
ten represent them in ways baffling to humans. 
e-Machines present the structures discovered 
by reconstruction in a clear and distinct way, 
but the learning dynamics are not (currently) 
as well understood as those of neural networks; 
see Sec. 5.2 below. 



5 Concluding Remarks 

5.1 Limitations of the Current 
Results 

We made some restrictive assumptions in our 
development above. Here we mention them in 
order of increasing severity and consider what 
may be done to lift them. 

1. We know exact joint probabilities over se- 
quence blocks of all lengths for a process. The 
cure for this is e-machine reconstruction and 
in the next subsection we sketch work (under- 
way) on a statistical theory of reconstruction. 

2. The observed process takes on discrete 
values. This can probably be addressed with 
only a modest cost in increased mathematical 
subtlety, since the information-theoretic quan- 
tities we have used also exist for continuous 
variables. Many of our results appear to carry 
over to the continuous setting. 

3. The process is discrete in time. This 
looks similarly solvable, since continuous-time 
stochastic process theory is moderately well 



developed. It may involve sophisticated prob- 
ability theory or functional analysis, however. 

4. The process is a pure time series; e.g., 
without spatial extent. There are already tricks 
to make spatially extended systems look like 
time series. Basically, one looks at all the 
paths through space-time, treating each one as 
if it were a time series. While this works well 



for data compression (Lempel and Ziv 1986) 



it may not be satisfactory for capturing struc- 
ture ( iFeldman 1998| ). 

5. The process is stationary. It's unclear 
how best to relax the assumption of station- 
arity. There are several straightforward ways 
of doing so, but it is unclear how much sub- 
stantive content these extensions have. In 
any case, a systematic classification of non- 
stationary processes is (at best) in its infant 
stages. 

5.2 Directions for Future Work 

Two broad avenues for research present them- 
selves. 

First, we have the mathematics of e- 
machines themselves. Assumption-lifting ex- 
tensions have just been mentioned but there 
are many other ways to go. One which is espe- 
cially interesting in the machine-learning con- 
text is the trade-off between prescience and 
complexity. For a given process there is a 
sequence of optimal machines connecting the 
one-state, zero-complexity machine with mini- 
mal prescience to the e-machine. Each step on 
the path is the minimal machine for a certain 
degree of prescience; it would be very inter- 
esting to know what, if anything, we can say 
in general about the shape of this "prediction 
frontier" . 

Second, there is e-machine reconstruction. 
As we remarked (p. |^), there are already sev- 
eral algorithms for reconstructing machines 
from data. What we need is knowledge of the 
error statistics (Mayo 1996) of different recon- 
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struction procedures, of the kinds of mistakes 
they make and the probabihties with which 
they make them. Ideally, we want to find "con- 
fidence regions" for the products of reconstruc- 
tion: calculating the probabilities of different 
degrees of reconstruction error for a given vol- 
ume of data or the amount of data needed to 
be confident of a fixed bound on the error. An 
analytical theory has been developed for the 
expected error in reconstructing certain kinds 
of processes ( Crutchficld and Douglas 1999 ). 
The results are encouraging enough that work 
is underway on a general theory of statistical 
inference for e-machines — a theory analogous 
to what already exists in computational learn- 
ing theory and grammatical inference. 
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